Part 1 : EDA

At first, we import the data and take a glance at what it actually looks like.

df <- read.csv(file = 'lyon_housing.csv')

head(df,10)
date_transaction type_purchase type_property rooms_count surface_housing surface_effective_usable surface_terrain parkings_count price address district latitude longitude date_construction
19/10/31 ancien maison 5 100 NA 247 0 530000 6 PAS DES ANTONINS Villeurbanne 45.78167 4.879333 03/06/11 11:38
18/11/26 ancien maison 2 52 NA 156 0 328550 12 RUE DU LUIZET Villeurbanne 45.78324 4.884683 03/06/11 11:38
16/08/04 ancien appartement 1 28 28.20 0 1 42500 4 RUE DE L ESPOIR Villeurbanne 45.78149 4.883474 03/06/11 11:38
16/11/18 ancien appartement 3 67 66.30 0 1 180900 6 RUE DE L ESPOIR Villeurbanne 45.78149 4.883474 03/06/11 11:38
16/12/16 ancien appartement 1 28 NA 0 1 97000 163 AV ROGER SALENGRO Villeurbanne 45.78149 4.883474 03/06/11 11:38
17/01/06 ancien appartement 2 42 NA 0 1 112000 163 AV ROGER SALENGRO Villeurbanne 45.78149 4.883474 03/06/11 11:38
17/02/01 ancien appartement 1 28 NA 0 1 93000 4 RUE DE L ESPOIR Villeurbanne 45.78149 4.883474 03/06/11 11:38
17/02/27 ancien appartement 4 84 83.60 0 1 152000 34 RUE DU LUIZET Villeurbanne 45.78149 4.883474 03/06/11 11:38
17/02/22 ancien appartement 1 28 28.30 0 1 83000 4 RUE DE L ESPOIR Villeurbanne 45.78149 4.883474 03/06/11 11:38
17/03/07 ancien appartement 2 42 42.01 0 1 115400 163 AV ROGER SALENGRO Villeurbanne 45.78149 4.883474 03/06/11 11:38

Empty Cells

In order to find out how much of the data is usable, the first thing we must do is to find out how many cells are actually empty or Na. These are what we’ll refer to as missing values later.

print(apply(is.na(df),2,sum))
##         date_transaction            type_purchase            type_property 
##                        0                        0                        0 
##              rooms_count          surface_housing surface_effective_usable 
##                        0                        0                    14491 
##          surface_terrain           parkings_count                    price 
##                        0                        0                        0 
##                  address                 district                 latitude 
##                        0                        0                      143 
##                longitude        date_construction 
##                      143                        0

As we can see, these values are missing:

surface_effective_usable = 14491

latitude = 143

longitude = 143

I wonder if missing latitudes and longitudes are all of the same rows because the number of those missinf is exactly the same.

lat_miss = which(is.na(df$latitude))

long_miss = which(is.na(df$longitude)) 

all(lat_miss == long_miss)
## [1] TRUE

Thus, we have totally 143 houses (rows) that miss coordinates and all the other values have both latitude and longitude.

Map the area covered

Now I want to draw a map just to realize what area on earth is covered in our dataset thus I remove all those rows that miss coordinate values and then draw the map. Also I’m gonna add a color gradient on price column to see if prices are significantly higher in some areas.

#remove all rows with a Na value
#df_noNa <- na.omit(df)



df <- df %>% drop_na(longitude, latitude)


#first draw this to check if the map has proper zoom
sp = ggplot(df, aes(x = longitude, y = latitude, colour = price , shape =type_property)) + geom_point()

show(sp)

#set borders of the map 

# border <- c(
#   left = min(df$longitude)-(min(df$longitude)*0.1),
#   bottom = min(df$latitude)-( min(df$latitude)*0.1),
#   right = max(df$longitude)+(max(df$longitude)*0.1),
#   top = max(df$latitude)+(max(df$latitude)*0.1)
# )


border <- c(
  left = min(df$longitude),
  bottom = min(df$latitude),
  right = max(df$longitude),
  top = max(df$latitude)
)

#get map image

map <- get_stamenmap(
  bbox = border,
  zoom = 10
)
## Source : http://tile.stamen.com/terrain/10/525/365.png
ggmap(map)+
geom_point(
    data = df,
    mapping = aes(x = longitude, y = latitude, color = price, shape = type_property)
    , size = 1.5
  )+ scale_color_gradientn(colours=c("darkorchid1","red"))+scale_shape_manual(values = c(4,15))

It seems like most of reddish points are shaped in square so later we’ll check if houses are generally more expensive than apartments.

Add price/area column

The area of the house affects its price. To compare prices, it is useful to add a column for price per \(m^2\)

price = df[, 9]
area = df[, 5]
ppm = price / area

df['price_per_meter'] <- ppm

Find out value ranges/types of each column

For further analysis we need to know what are the values of each column.

for (column in colnames(df) ){
  print(as.Set(df[column]))
}
## $date_transaction
## {16/07/01, 16/07/04,...,21/06/29, 21/06/30} 
## 
## $type_purchase
## {ancien, VEFA} 
## 
## $type_property
## {appartement, maison} 
## 
## $rooms_count
## {1, 2,...,5, 6} 
## 
## $surface_housing
## {100, 101,...,98, 99} 
## 
## $surface_effective_usable
## {100, 100.03,...,99.91, NA} 
## 
## $surface_terrain
## {0, 102,...,96, 98} 
## 
## $parkings_count
## {0, 1, 2, 3} 
## 
## $price
## {100001, 1001000,...,99995, 9e+05} 
## 
## $address
## {, 1 ALL ATHENA,...,99 RUE PIERRE CORNEILLE, 99 RUE TETE D'OR} 
## 
## $district
## {Lyon 1er Arrondissement, Lyon 2e Arrondissement,...,Lyon 9e Arrondissement, Villeurbanne} 
## 
## $latitude
## {45.722048, 45.722102,...,45.805026, 45.805458} 
## 
## $longitude
## {4.773162, 4.773826,...,4.919765, 4.920161} 
## 
## $date_construction
## {00/01/08 11:38, 00/01/13 11:38,...,99/12/01 11:38, 99/12/17 11:38} 
## 
## $price_per_meter
## {1000, 10000,...,9878.0487804878, 9913.4345}

We have two columns that show date, Thus we reformat those values to optimize them for later calculations. Since the hour of construction means nothing, we omit those values. Besides, the time seems to be only two values. let’s check:

modified = strptime(df[['date_construction']], "%Y/%m/%d %H:%M")
time = format(modified, format = "%H:%M:%S")
print(as.Set(time))
## {00:00:00, 11:38:00}
#DO NOT run these lines more than once

df$date_construction = strptime(df[['date_construction']], "%Y/%m/%d")
df$date_construction = as.Date(df$date_construction, format= "%Y-%m-%d")
df$date_transaction = strptime(df[['date_transaction']], "%Y/%m/%d")
df$date_transaction = as.Date(df$date_transaction, format= "%Y-%m-%d")

Also if construction date is in the future and sale type isn’t sale before completion, then data is wrecked and must be removed. However since we know data of this column has been generated by adding 2 years to the date of generating map for house, we don’t change this column to the date when house’s map had been produced and just let it be. This is just a 2 year shift in data and since it’s constant, has no serious affect on our analysis.

now =as.Date(Sys.Date(), format= "%Y-%m-%d")

year(now)<-year(now)-2000

#this is rows that have messed up values! 
future = subset(df, date_construction > now & date_construction < "0090-01-01" & type_purchase != "VEFA" )


#remove
#df <- df[(df$date_construction > now & df$date_construction < "0090-01-01" & df$type_purchase != "VEFA"),]

For better comprehension of price range and property type, let’s draw two charts:

barplot(table(df$type_property),col="purple",horiz = T)

  # plot the histogram
  hist(df$price, main = paste('Prices Distribution'), xlab = 'Price', ylab = 'Count', col="purple",probability = T )
  
  # fit a normal curve to the histogram
  lines(density(df$price), col="magenta", lwd=3) # add a density estimate with defaults
  lines(density(df$price, adjust=2), lty="dashed", col="darkblue", lwd=3) 
  

  
# Adding a legend
legend("topright", c("fit normal", "real curve" ),
       lty = c(2, 1), col = c("darkblue","magenta"), box.lty = 4, lwd = 3)

Correlation between columns

Earlier we saw correlation between surface, effective surface and room count. Instead of finding out correlation two by two we use the code below to get it far all columns all at once. I remove geographical coordinates since they don’t provide us much information when treated as numbers. I used use = “pairwise.complete.obs” because we have a fair amount of missing data and so we would be looking for a sensible multiple imputation strategy to fill in the spaces.

num_cols =unlist(lapply(df, is.numeric))  
cor_df = df[,num_cols]

cor_df = cor_df[,-which(names(cor_df) %in% c("latitude","longitude"))]


res <-cor(cor_df, use = "pairwise.complete.obs")

corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

As we can see effective surface and surface have high correltion.

Before plotting the rest of the columns, we remove missing values and try to remove some columns that don’t provide us fresh information.

#?cor

df_noNa =na.omit(df[,c('surface_housing','surface_effective_usable')])

cor_effective_surface = cor(df_noNa$surface_housing, df_noNa$surface_effective_usable)

cor_room = cor(df$surface_housing, df$rooms_count)

print(paste("correlation of surface and effective surface = ",round(cor_effective_surface,4)))
## [1] "correlation of surface and effective surface =  0.9625"
print(paste("correlation of surface and effective surface = ",round(cor_room,4)))
## [1] "correlation of surface and effective surface =  0.8384"

Although both room count and effective surface seem related to house surface but effective surface has larger correlation, they almost represent the same thing and thus we remove surface_effective_usable column. This column already contains 14491 missing values and since surface_housing is rounded to the lowest integer, comparing surface_effective_usable to it is useless because sometimes surface_effective_usable is even larger than surface_housing by a fraction. But we know that is because the fraction part of surface_housing is removed.

df = df[,-which(names(df) %in% c("surface_effective_usable"))]

So lets draw correlation matrix once again and take a look at relation between columns:

#separate numeric columns
my_data <- df[, c(4:8)]
chart.Correlation(my_data, histogram=TRUE, pch=19)

num_cols =unlist(lapply(df, is.numeric))  
cor_df = df[,num_cols]

cor_df = cor_df[,-which(names(cor_df) %in% c("latitude","longitude"))]


res <-cor(cor_df, use = "pairwise.complete.obs")

corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

Plotting the rest of columns’ distribution

Now we draw diagrams for the rest of the columns

hist(df$parkings_count, xlab = 'Number of Parking spots', col = 'darkorchid1',  main = 'Parking Count Histogram')

hist.1 = hist(df$rooms_count, main = 'Number of Rooms', xlab = 'number', col = 'magenta3')

#hist.trimmed = TrimHistogram(hist.1)

hist(df$surface_housing, main = 'Surface Housing', xlab = 'Area', col = 'darkorchid3')

# lines(density(df$surface_housing), col="magenta", lwd=3) # add a density estimate with defaults
# lines(density(df$surface_housing, adjust=2), lty="dashed", col="darkblue", lwd=3) 
# 

barplot(table(df$type_purchase), col = "mediumorchid")

barplot(table(df$district), col = "mediumorchid3")

Now we plot the scatter plot for room count and area

barplot(table(df$rooms_count), col = "mediumorchid4", main = 'room count barplot')

This chart is good for finding out outliers and ranges but shows nothing about the relation between number of rooms and area. We’ll discuss this in next parts.

Outliers

Univariate approach

For a given continuous variable, outliers are those observations that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ is the difference between 75th and 25th quartiles. Look at the points outside the whiskers in below box plot.

outlier_values <- boxplot.stats(df$price)$out  # outlier values.

boxplot(df$price, main="Price")

boxplot(df$surface_housing, main="Surface Housing")

We can also detect outliers for different ranges:

For prices, we analyze it based on two values: type of property and area

boxplot(price ~ surface_housing, data=df, main="Boxplot for price vs surface", col = 'darkorchid1')

boxplot(price ~ cut(surface_housing, pretty(df$surface_housing)), data=df, main="Boxplot for price (categorial) vs surface", cex.axis=0.5, col = "magenta4",xlab = 'Area')

boxplot(price ~ type_property, data = df, col = 'mediumorchid1'
)

We can do the same for rooms count

boxplot(rooms_count ~ surface_housing, data=df, col = 'purple',main="Boxplot for rooms count vs surface")

boxplot(rooms_count ~ cut(surface_housing, pretty(df$surface_housing)), xlab = 'Area',data=df, col = 'purple', main="Boxplot for rooms count (categorial) vs surface", cex.axis=0.5)

This is where plots get interesting! except for a few outliers, number of rooms increase almost linear to surface, however after a point, number of room remains constant while surface increases. It seems like, for houses larger than a certain value, people prefer larger rooms than more number of rooms.

Adding subway station file

we import the .json file and load it into a data frame. Then again draw their map and try to find the nearest station to each house. We’ll add a new column for this value and this will help us in part 3.

The station file contains station names, latitude and longitude for each.

json_file  <- fromJSON('station_coordinates.json')



#unlist the json file
json_file <- lapply(json_file, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})


stations = c()
lat = c()
long = c()

#iterate through json list and store data in vectors

for (i in (1:length(json_file))){
  atomic_vec = json_file[[i]]
n=length(atomic_vec)/3

for (j in (1:n)){
  stations = c(stations, atomic_vec[[j]])
}

for (j in ((n+1):(2*n))){
  lat = c(lat, atomic_vec[[j]])
}
for (j in ((2*n+1):(3*n))){
  long = c(long, atomic_vec[[j]])
}

}


station_df <- data.frame(stations, lat, long)

head(station_df)
stations lat long
Ampère - Victor Hugo 45.7530516 4.8291603
Cordeliers 45.7633942 4.8358228
Cusset 45.7656434 4.9008478
Flachet - Alain Gilles 45.7677765 4.8893766
Foch 45.7687864 4.8443036
Gratte-Ciel 45.7690539 4.8823107

To calculate distance, We use spherical geometry to extract distance from geo coordinates. (I studied Astronomy Olympiad :) )

a typicall spherical triangle

for angle \(c\) in above picture we have:

\[\cos(c) = \cos(a)\cos(b)+ \sin(a)\sin(b)\cos(C) \] to use this formula in our problem, place \(u\) on north pole:

a typicall spherical triangle

Thus the formula becomes like this:

\[\cos(\theta) =\sin(\phi_a)\sin(\phi_b)+ \cos(\phi_a)\cos(\phi_b)\cos(\Delta \lambda) \] where \(\phi\) is latitude and \(\lambda\) in longitude. To get distance in meters:

\[d = \theta^{rad}\times R_{\oplus} \] where \(R_{\oplus} = 6378000\) is radius of earth.

#convert degrees to radians
to_rad <- function(x){
  return (x*pi/180)
}

measure_distance <- function(phi1,phi2,l1,l2) {
  phi1 = to_rad(phi1)
  phi2 = to_rad(phi2)
  l1 = to_rad(l1)
  l2 = to_rad(l2)
  
  delta = abs(l1-l2)
  
  theta = sin(phi1)*sin(phi2)+cos(phi1)*cos(phi2)*cos(delta)
  d = acos(theta)*6378000
  
  return (d)
}

min_distance <- function(x, y) {
  all_d = c()
  for (i in (1:nrow(station_df))){
    all_d = c(all_d, measure_distance(x,as.numeric(station_df$lat[i]),y,as.numeric(station_df$long[i])))
  }
    
  return(min(all_d))
}


closest <- function(x, y) {
  all_d = c()
  for (i in (1:nrow(station_df))){
    all_d = c(all_d, measure_distance(x,as.numeric(station_df$lat[i]),y,as.numeric(station_df$long[i])))
  }
  i <- which.min(all_d)
  return(station_df$stations[i])
}

Now we add two columns: 1. distance to the closest station and the name of that station.

new_col = c()
for (i in 1 : nrow(df)){
  
  new_col = c(new_col, min_distance(as.numeric(df[['latitude']][i]), as.numeric(df[['longitude']][i])))
  
}

df['closest_distance'] <- new_col
new_col = c()
for (i in 1 : nrow(df)){
  
  new_col = c(new_col, closest(as.numeric(df[['latitude']][i]), as.numeric(df[['longitude']][i])))
  
}

df['closest_station'] <- new_col

let’s draw a chart:

hist(df$closest_distance, main='Distance to Nearest Station', xlab = 'distance', col = 'orchid')

Now we draw the correlation matrix again.

num_cols =unlist(lapply(df, is.numeric))  
cor_df = df[,num_cols]

cor_df = cor_df[,-which(names(cor_df) %in% c("latitude","longitude"))]


res <-cor(cor_df, use = "pairwise.complete.obs")

corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

As we can see, our new column closest distance in not correlated to any of other columns.

#Part 2 : Inference & Visualization

1: correlation between rooms_count and surface_housing

Earlier we saw a relation between number of rooms and area. Here I use room count as a dummy variable and see if the regression gets better or not. To do this we must create a new data frame and also we know that room count column ranges from 1 to 6. thus we put 5 columns for room counts and 6 (or more) will be the indicator.

We’ll also do the same for district. If columns dist 1 to 9 are all zero then district is “Villeurbanne”.

y = df$rooms_count
x = df$surface_housing

# Change point shape (pch = 19) and remove frame.
plot(x, y, main = "room count vs area",
     xlab = "area", ylab = "number of rooms",
     pch = 19, frame = FALSE)
abline(lm(y ~ x, data = df), col = "purple")
lines(lowess(x, y), col = "blue")

scatterplot(rooms_count ~ surface_housing, data = df, col = 'purple')

#import data
tosave = data.frame(room1 = c(0),room2 = c(0),room3 = c(0), room4= c(0),room5 = c(0),dist1 = c(0),dist2= c(0),dist3 = c(0),dist4 = c(0),dist5 = c(0),dist6= c(0),dist7= c(0),dist8 = c(0),dist9 = c(0),area= c(0))


room_vec <- function(n) {
  if (n > 5){
    return(c(rep(0,5)))
  }
    vec = c(rep(0,5))
    vec[n] <- 1
    return (vec)
}




which_dist <- function(dist) {
    vec = c(rep(0,9))
    if (dist == "Villeurbanne"){
      return(vec)
    }
    if (dist == "Lyon 1er Arrondissement"){
      vec[1] <- 1
    }
    if (dist == "Lyon 2e Arrondissement"){
      vec[2] <- 1
    }
    if (dist == "Lyon 3e Arrondissement"){
      vec[3] <- 1
    }
    if (dist == "Lyon 4e Arrondissement"){
      vec[4] <- 1
    }
    if (dist == "Lyon 5e Arrondissement"){
      vec[5] <- 1
    }
    if (dist == "Lyon 6e Arrondissement"){
      vec[6] <- 1
    }
    if (dist == "Lyon 7e Arrondissement"){
      vec[7] <- 1
    }
    if (dist == "Lyon 8e Arrondissement"){
      vec[8] <- 1
    }
    if (dist == "Lyon 9e Arrondissement"){
      vec[9] <- 1
    }
    
    return (vec)
}


#######copy: the damn loop takes 30 minutes to complete!
# 
#   for (row in rownames(df)) {
#     print(row)
#           tosave[nrow(tosave) + 1,] = c(room_vec(as.numeric(df[row,4])),which_dist(as.character(df[row,10])),as.numeric(df[row,5]))
#   }
#  
# ####### removing the first row which was 0 to initialize the data frame
#  tosave <- tosave[-1,]
# # 
# 
# #### because the loop takes too long to complete I ran it once and saved the result to read from it later.
# write.csv(tosave,"roomdummy.csv", row.names = FALSE)


dummydf <- read.csv(file = 'roomdummy.csv')

dummydf['total_price'] = df['price']
dummydf['price_meter'] = df['price_per_meter']

head(dummydf)
room1 room2 room3 room4 room5 dist1 dist2 dist3 dist4 dist5 dist6 dist7 dist8 dist9 area total_price price_meter
0 0 0 0 1 0 0 0 0 0 0 0 0 0 100 530000 5300.000
0 1 0 0 0 0 0 0 0 0 0 0 0 0 52 328550 6318.269
1 0 0 0 0 0 0 0 0 0 0 0 0 0 28 42500 1517.857
0 0 1 0 0 0 0 0 0 0 0 0 0 0 67 180900 2700.000
1 0 0 0 0 0 0 0 0 0 0 0 0 0 28 97000 3464.286
0 1 0 0 0 0 0 0 0 0 0 0 0 0 42 112000 2666.667
tail(dummydf)
room1 room2 room3 room4 room5 dist1 dist2 dist3 dist4 dist5 dist6 dist7 dist8 dist9 area total_price price_meter
40368 0 1 0 0 0 0 0 0 0 0 0 0 0 1 25 159000 6360.000
40369 0 1 0 0 0 0 0 0 0 0 0 0 0 1 34 226000 6647.059
40370 0 1 0 0 0 0 0 0 0 0 0 0 0 1 33 217300 6584.848
40371 0 1 0 0 0 0 0 0 0 0 0 0 0 1 23 145000 6304.348
40372 0 1 0 0 0 0 0 0 0 0 0 0 0 1 34 206000 6058.824
40373 0 1 0 0 0 0 0 0 0 0 0 0 0 1 30 187500 6250.000

now we do regression for both the dummy variable data and the original one.

reg2 = lm(surface_housing~rooms_count , df)
reg = lm(area~room1+room2+room3+room4+room5 , dummydf)

print("regression for original data:")
## [1] "regression for original data:"
summary(reg2)
## 
## Call:
## lm(formula = surface_housing ~ rooms_count, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.446  -8.397  -2.300   5.652 230.603 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.25111    0.19661   47.05   <2e-16 ***
## rooms_count 20.04868    0.06487  309.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 40371 degrees of freedom
## Multiple R-squared:  0.7029, Adjusted R-squared:  0.7029 
## F-statistic: 9.553e+04 on 1 and 40371 DF,  p-value: < 2.2e-16
print("regression for dummy variable version:")
## [1] "regression for dummy variable version:"
summary(reg)
## 
## Call:
## lm(formula = area ~ room1 + room2 + room3 + room4 + room5, data = dummydf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.708  -8.290  -1.912   5.637 231.088 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  146.7076     0.6300  232.86   <2e-16 ***
## room1       -114.4171     0.6596 -173.45   <2e-16 ***
## room2        -98.3446     0.6465 -152.13   <2e-16 ***
## room3        -77.7952     0.6443 -120.74   <2e-16 ***
## room4        -59.5074     0.6531  -91.12   <2e-16 ***
## room5        -34.6647     0.6986  -49.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.15 on 40367 degrees of freedom
## Multiple R-squared:  0.712,  Adjusted R-squared:  0.712 
## F-statistic: 1.996e+04 on 5 and 40367 DF,  p-value: < 2.2e-16

The \(R^2\) is large and p-value is pretty small. The \(R^2\) and p-values don’t differ much in two versions. That might be because the number of rooms is really a number, and we except that by increasing it, observe increasing in area too. Therefore we can conclude room count and area are very much related.

Generally finding relation between columns is important. one reason is that when we try to visualize this data, the more variables we use, the more dimensions we have and we can’t plot more than 3 dimensions all at once. Thus finding relation between two columns and then removing one of them or replacing both of them with a new value that represents them, (using PCA or t-sne for example) reduces dimensions and make visualization easier.

2: relation between price and district. does it exist?

Let’s draw price map again but this time we use price per meter so that the size of the house doesn’t affects our analysis.

First we draw a map to make distinction between districts visible.

ggplot() +geom_point( data=df, aes(y=latitude, x=longitude, color = as.factor(district)), size = 0.4)

Now draw price gradient for both total price and price per meter

print("price per meter")
## [1] "price per meter"
ggmap(map)+
geom_point(
    data = df,
    mapping = aes(x = longitude, y = latitude, color = price_per_meter, shape = type_property)
    , size = 1.5
  )+ scale_color_gradientn(colours=c("black","red"))+scale_shape_manual(values = c(4,15))

print("price")
## [1] "price"
ggmap(map)+
geom_point(
    data = df,
    mapping = aes(x = longitude, y = latitude, color = price, shape = type_property)
    , size = 1.5
  )+ scale_color_gradientn(colours=c("black","red"))+scale_shape_manual(values = c(4,15))

From what we saw in first part when we drew map, it seemed like prices of different areas was higher or lower. However now comparing both maps we see that, the difference might be because of houses being larger in different districts rather than more expensive. To make sure of this we do a regression model.

model <- lm(price_per_meter ~ district, data = df)
summary(model)
## 
## Call:
## lm(formula = price_per_meter ~ district, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4106.2  -847.0   -98.4   759.6 31387.0 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4653.86      29.46 157.992   <2e-16 ***
## districtLyon 2e Arrondissement   369.98      44.19   8.373   <2e-16 ***
## districtLyon 3e Arrondissement  -489.72      33.65 -14.554   <2e-16 ***
## districtLyon 4e Arrondissement   -16.53      39.52  -0.418    0.676    
## districtLyon 5e Arrondissement -1073.39      37.49 -28.628   <2e-16 ***
## districtLyon 6e Arrondissement   452.34      38.88  11.635   <2e-16 ***
## districtLyon 7e Arrondissement  -394.44      34.17 -11.544   <2e-16 ***
## districtLyon 8e Arrondissement  -874.57      34.79 -25.136   <2e-16 ***
## districtLyon 9e Arrondissement -1185.26      36.67 -32.323   <2e-16 ***
## districtVilleurbanne           -1290.89      31.95 -40.401   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 40363 degrees of freedom
## Multiple R-squared:  0.1646, Adjusted R-squared:  0.1644 
## F-statistic: 883.4 on 9 and 40363 DF,  p-value: < 2.2e-16

The \(R^2\) is close to 0 and is very low. Thus district and price are not correlated and in opposition to what we might think, prices in different districts don’t differ.

Here is a histogram of each district’s mean price/meter

dists = unique(df[c('district')])

mean_dist_price = c()

for (i in (1:nrow(dists))){
  word = as.character(dists[i,1])
  filtered <- df[as.character(df$district) == dists[i,1],]
  
  mean_dist_price <- c(mean_dist_price, mean(filtered$price_per_meter))
}
  

par(mar=c(11,4,4,4))

barplot(mean_dist_price,
        main = "distrinct mean prices",
        ylab = "mean price",
        names.arg = dists$district, cex.names  = 1,las =2,
        col = "hotpink")

However the correct way to model a regression between price and district is to use them as dummy variables. Let’s do this

reg = lm(price_meter~dist1+dist2+dist3+dist4+dist5+dist6+dist7+dist8+dist9 , dummydf)
print("regression for dummy variable version of price/meter vs district:")
## [1] "regression for dummy variable version of price/meter vs district:"
summary(reg)
## 
## Call:
## lm(formula = price_meter ~ dist1 + dist2 + dist3 + dist4 + dist5 + 
##     dist6 + dist7 + dist8 + dist9, data = dummydf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4106.2  -847.0   -98.4   759.6 31387.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3362.98      12.38 271.674  < 2e-16 ***
## dist1        1290.89      31.95  40.401  < 2e-16 ***
## dist2        1660.87      35.18  47.204  < 2e-16 ***
## dist3         801.17      20.44  39.199  < 2e-16 ***
## dist4        1274.35      29.11  43.772  < 2e-16 ***
## dist5         217.50      26.29   8.272  < 2e-16 ***
## dist6        1743.23      28.23  61.742  < 2e-16 ***
## dist7         896.45      21.29  42.116  < 2e-16 ***
## dist8         416.31      22.28  18.689  < 2e-16 ***
## dist9         105.62      25.10   4.208 2.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1256 on 40363 degrees of freedom
## Multiple R-squared:  0.1646, Adjusted R-squared:  0.1644 
## F-statistic: 883.4 on 9 and 40363 DF,  p-value: < 2.2e-16

The R-squared is very low and thus district is only responsible for 16% of changes in price. One interesting thing is that dist6 has highest coefficient, and we saw earlier that dist6 had the highest mean. Also dist9 has the smallest coefficient as in histogram has the lowest height.

But here we want to compare districts and the goal isn’t to predict price or model it as a function of other variables. price may not entirely depend on district, while still there can be meaningful differences between average prices of districts. Besides, modeling price as a function of just districts, will result in a regression model using just dummy variables which isn’t the best way to get our answer. Therefore for more accurate analysis, we compare districts two by two and do a t-test to find out what districts have meaningful difference. Looking at the histogram, we see that for example districts 2 and 6 have approximately same average but district 5 and 6 are significantly different.

In the t-test, the null hypothesis is that difference in means is 0. (two groups have same average and aren’t different). The alternative is that there is a difference between means. With a threshold of 0.05 we’ll fail to reject the null hypothesis if p-value > 0.05 and that would mean with confidence level of 95%, there isn’t meaningful difference between the price of those districts.

#initialize frame
  word = as.character(dists[1,1])
  major = df[df$"district" == word, c("price_per_meter")]
  minor = df[df$"district" == as.character(dists[1,1]), c("price_per_meter")]
  res = t.test(major,minor)
  temp = c(word,as.character(dists[1,1]),tidy(res))
  tdf = data.frame(temp)
  tdf = tdf[-1,]
  
#iterate through districts and compare each with other ones.
for (i in (1:nrow(dists))){
  if (i == 10){
    break
  }
  word = as.character(dists[i,1])
  major = df[df$"district" == word, c("price_per_meter")]
  
  
  for (j in ((i+1):nrow(dists))){
    minor = df[df$"district" == as.character(dists[j,1]), c("price_per_meter")]
    res = t.test(major,minor)
    temp = c(word,as.character(dists[j,1]),tidy(res))
    tdf[nrow(tdf)+1,] = temp
  }
  

}
  

head(tdf,'all')
X.Villeurbanne. X.Villeurbanne..1 estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
Villeurbanne Lyon 1er Arrondissement -1290.88508 3362.978 4653.863 -35.3275890 0.0000000 2230.533 -1362.54187 -1219.22828 Welch Two Sample t-test two.sided
Villeurbanne Lyon 2e Arrondissement -1660.86812 3362.978 5023.846 -41.0054642 0.0000000 1715.569 -1740.30972 -1581.42651 Welch Two Sample t-test two.sided
Villeurbanne Lyon 3e Arrondissement -801.16910 3362.978 4164.147 -41.7518328 0.0000000 12218.838 -838.78226 -763.55595 Welch Two Sample t-test two.sided
Villeurbanne Lyon 4e Arrondissement -1274.35187 3362.978 4637.330 -40.1260694 0.0000000 2986.459 -1336.62302 -1212.08072 Welch Two Sample t-test two.sided
Villeurbanne Lyon 5e Arrondissement -217.49561 3362.978 3580.473 -7.7286288 0.0000000 4163.302 -272.66808 -162.32314 Welch Two Sample t-test two.sided
Villeurbanne Lyon 6e Arrondissement -1743.22973 3362.978 5106.208 -59.7422081 0.0000000 3395.631 -1800.44030 -1686.01916 Welch Two Sample t-test two.sided
Villeurbanne Lyon 7e Arrondissement -896.44591 3362.978 4259.424 -44.1516309 0.0000000 10180.982 -936.24536 -856.64646 Welch Two Sample t-test two.sided
Villeurbanne Lyon 8e Arrondissement -416.31272 3362.978 3779.291 -18.8128277 0.0000000 8094.375 -459.69163 -372.93381 Welch Two Sample t-test two.sided
Villeurbanne Lyon 9e Arrondissement -105.62467 3362.978 3468.603 -4.6313766 0.0000037 5703.519 -150.33373 -60.91561 Welch Two Sample t-test two.sided
Lyon 1er Arrondissement Lyon 2e Arrondissement -369.98304 4653.863 5023.846 -7.1023710 0.0000000 3114.119 -472.12293 -267.84315 Welch Two Sample t-test two.sided
Lyon 1er Arrondissement Lyon 3e Arrondissement 489.71597 4653.863 4164.147 12.8991454 0.0000000 2574.249 415.27095 564.16099 Welch Two Sample t-test two.sided
Lyon 1er Arrondissement Lyon 4e Arrondissement 16.53321 4653.863 4637.330 0.3623626 0.7171011 3812.961 -72.92077 105.98719 Welch Two Sample t-test two.sided
Lyon 1er Arrondissement Lyon 5e Arrondissement 1073.38947 4653.863 3580.473 24.8547293 0.0000000 3674.873 988.71754 1158.06139 Welch Two Sample t-test two.sided
Lyon 1er Arrondissement Lyon 6e Arrondissement -452.34466 4653.863 5106.208 -10.3110838 0.0000000 3669.387 -538.35616 -366.33316 Welch Two Sample t-test two.sided
Lyon 1er Arrondissement Lyon 7e Arrondissement 394.43917 4653.863 4259.424 10.2343952 0.0000000 2715.109 318.86740 470.01093 Welch Two Sample t-test two.sided
Lyon 1er Arrondissement Lyon 8e Arrondissement 874.57236 4653.863 3779.291 22.1227853 0.0000000 2958.569 797.05807 952.08664 Welch Two Sample t-test two.sided
Lyon 1er Arrondissement Lyon 9e Arrondissement 1185.26041 4653.863 3468.603 29.6941417 0.0000000 3009.970 1106.99574 1263.52508 Welch Two Sample t-test two.sided
Lyon 2e Arrondissement Lyon 3e Arrondissement 859.69901 5023.846 4164.147 20.5702694 0.0000000 1935.160 777.73442 941.66360 Welch Two Sample t-test two.sided
Lyon 2e Arrondissement Lyon 4e Arrondissement 386.51625 5023.846 4637.330 7.9110494 0.0000000 2989.438 290.71799 482.31451 Welch Two Sample t-test two.sided
Lyon 2e Arrondissement Lyon 5e Arrondissement 1443.37251 5023.846 3580.473 30.9818374 0.0000000 2744.996 1352.02201 1534.72300 Welch Two Sample t-test two.sided
Lyon 2e Arrondissement Lyon 6e Arrondissement -82.36162 5023.846 5106.208 -1.7441464 0.0812435 2795.149 -174.95461 10.23138 Welch Two Sample t-test two.sided
Lyon 2e Arrondissement Lyon 7e Arrondissement 764.42220 5023.846 4259.424 18.0642751 0.0000000 2026.764 681.43325 847.41116 Welch Two Sample t-test two.sided
Lyon 2e Arrondissement Lyon 8e Arrondissement 1244.55540 5023.846 3779.291 28.7942832 0.0000000 2188.004 1159.79434 1329.31645 Welch Two Sample t-test two.sided
Lyon 2e Arrondissement Lyon 9e Arrondissement 1555.24345 5023.846 3468.603 35.6929305 0.0000000 2234.908 1469.79592 1640.69098 Welch Two Sample t-test two.sided
Lyon 3e Arrondissement Lyon 4e Arrondissement -473.18277 4164.147 4637.330 -14.1722739 0.0000000 3566.799 -538.64410 -407.72143 Welch Two Sample t-test two.sided
Lyon 3e Arrondissement Lyon 5e Arrondissement 583.67349 4164.147 3580.473 19.4765139 0.0000000 5091.694 524.92319 642.42379 Welch Two Sample t-test two.sided
Lyon 3e Arrondissement Lyon 6e Arrondissement -942.06063 4164.147 5106.208 -30.4435296 0.0000000 4144.323 -1002.72851 -881.39275 Welch Two Sample t-test two.sided
Lyon 3e Arrondissement Lyon 7e Arrondissement -95.27681 4164.147 4259.424 -4.1846779 0.0000288 10988.826 -139.90621 -50.64740 Welch Two Sample t-test two.sided
Lyon 3e Arrondissement Lyon 8e Arrondissement 384.85638 4164.147 3779.291 15.7664562 0.0000000 9495.696 337.00791 432.70485 Welch Two Sample t-test two.sided
Lyon 3e Arrondissement Lyon 9e Arrondissement 695.54444 4164.147 3468.603 27.7936313 0.0000000 7105.987 646.48736 744.60151 Welch Two Sample t-test two.sided
Lyon 4e Arrondissement Lyon 5e Arrondissement 1056.85626 4637.330 3580.473 26.9444149 0.0000000 4850.805 979.96028 1133.75224 Welch Two Sample t-test two.sided
Lyon 4e Arrondissement Lyon 6e Arrondissement -468.87786 4637.330 5106.208 -11.7293992 0.0000000 4636.048 -547.24707 -390.50866 Welch Two Sample t-test two.sided
Lyon 4e Arrondissement Lyon 7e Arrondissement 377.90596 4637.330 4259.424 11.1015109 0.0000000 3793.211 311.16563 444.64629 Welch Two Sample t-test two.sided
Lyon 4e Arrondissement Lyon 8e Arrondissement 858.03915 4637.330 3779.291 24.4036714 0.0000000 4165.080 789.10630 926.97200 Welch Two Sample t-test two.sided
Lyon 4e Arrondissement Lyon 9e Arrondissement 1168.72720 4637.330 3468.603 32.8383540 0.0000000 4170.256 1098.95122 1238.50318 Welch Two Sample t-test two.sided
Lyon 5e Arrondissement Lyon 6e Arrondissement -1525.73412 3580.473 5106.208 -41.0520615 0.0000000 5288.241 -1598.59450 -1452.87375 Welch Two Sample t-test two.sided
Lyon 5e Arrondissement Lyon 7e Arrondissement -678.95030 3580.473 4259.424 -22.1200112 0.0000000 5412.017 -739.12278 -618.77783 Welch Two Sample t-test two.sided
Lyon 5e Arrondissement Lyon 8e Arrondissement -198.81711 3580.473 3779.291 -6.2265029 0.0000000 5875.164 -261.41319 -136.22103 Welch Two Sample t-test two.sided
Lyon 5e Arrondissement Lyon 9e Arrondissement 111.87094 3580.473 3468.603 3.4524090 0.0005597 5661.324 48.34721 175.39468 Welch Two Sample t-test two.sided
Lyon 6e Arrondissement Lyon 7e Arrondissement 846.78382 5106.208 4259.424 26.7563043 0.0000000 4422.370 784.73788 908.82977 Welch Two Sample t-test two.sided
Lyon 6e Arrondissement Lyon 8e Arrondissement 1326.91701 5106.208 3779.291 40.3945026 0.0000000 4855.943 1262.51820 1391.31583 Welch Two Sample t-test two.sided
Lyon 6e Arrondissement Lyon 9e Arrondissement 1637.60507 5106.208 3468.603 49.1642374 0.0000000 4778.781 1572.30435 1702.90579 Welch Two Sample t-test two.sided
Lyon 7e Arrondissement Lyon 8e Arrondissement 480.13319 4259.424 3779.291 18.9808252 0.0000000 9525.229 430.54824 529.71815 Welch Two Sample t-test two.sided
Lyon 7e Arrondissement Lyon 9e Arrondissement 790.82124 4259.424 3468.603 30.5452669 0.0000000 7390.948 740.06919 841.57330 Welch Two Sample t-test two.sided
Lyon 8e Arrondissement Lyon 9e Arrondissement 310.68805 3779.291 3468.603 11.3617314 0.0000000 7584.151 257.08402 364.29208 Welch Two Sample t-test two.sided

the districts that have no meaningful difference in price are:

tdf[which(tdf$'p.value'>0.05),]
##            X.Villeurbanne.      X.Villeurbanne..1  estimate estimate1 estimate2
## 12 Lyon 1er Arrondissement Lyon 4e Arrondissement  16.53321  4653.863  4637.330
## 21  Lyon 2e Arrondissement Lyon 6e Arrondissement -82.36162  5023.846  5106.208
##     statistic   p.value parameter   conf.low conf.high                  method
## 12  0.3623626 0.7171011  3812.961  -72.92077 105.98719 Welch Two Sample t-test
## 21 -1.7441464 0.0812435  2795.149 -174.95461  10.23138 Welch Two Sample t-test
##    alternative
## 12   two.sided
## 21   two.sided

This result indicates that all district have different prices expect for districts 1 and 4, and 2 and 6 that have same average.

looking at the histogram this result might seem bizarre since it seems like districts 5 and 9 also have the same height but t-test says otherwise. It’s because the bars aren’t next to each other and comparing them with naked eye causes error. We can also look at the averages:

print(mean_dist_price)
##  [1] 3362.978 4653.863 5023.846 4164.147 4637.330 3580.473 5106.208 4259.424
##  [9] 3779.291 3468.603

districts 1 and 4 means are respectively 4653 and 4637 which is really close thus the difference is 16 (0.3%) . Also districts 2 and 6 means are 5023 and 5106 respectively thus the difference is 83 (1%). These two groups were known as same in the t-test. But districts 5 and 9 means are 3580 and 3468 respectively thus the difference is 112 which is significantly high (3%) and that’s why in t-test 5 and 9 were concluded to be different in prices.

3: Is the price distribution normal?

In this part we check whether the prices we have in our data set comes from a normal distribution or not?

Note that for this part we’ll use price per meter values to prevent size of house to interfere with our analysis.

We start by displaying a random sample of 10 rows using the function sample_n()[in dplyr package].

There are several methods for normality test such as Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s test.

The null hypothesis of these tests is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.

Shapiro-Wilk’s method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.

The R function shapiro.test() can be used to perform the Shapiro-Wilk test of normality for one variable (univariate):

set.seed(13)

#Show 10 random rows:
dplyr::sample_n(df, 10)
##    date_transaction type_purchase type_property rooms_count surface_housing
## 1        0017-12-11        ancien   appartement           2              46
## 2        0020-04-09        ancien   appartement           2              62
## 3        0019-05-13        ancien   appartement           4              67
## 4        0017-03-10        ancien   appartement           4              64
## 5        0017-05-10          VEFA   appartement           2              34
## 6        0020-03-12          VEFA   appartement           4              79
## 7        0021-04-30        ancien   appartement           2              53
## 8        0018-08-31        ancien   appartement           5             110
## 9        0019-05-31        ancien   appartement           5             115
## 10       0018-12-28        ancien   appartement           1              20
##    surface_terrain parkings_count  price                      address
## 1                0              1 170000 9 RUE PROFESSEUR PAUL SISLEY
## 2                0              1 218000         63 RUE PAUL VERLAINE
## 3                0              0 164000       173 RUE ANATOLE FRANCE
## 4                0              0 150000         12 AV CDT LHERMINIER
## 5                0              1 153100  16 RUE DES DOCTEURS CORDIER
## 6                0              1 424860   140 RUE COMMANDANT CHARCOT
## 7                0              0 165000              3 RUE DU MARCHE
## 8                0              1 361700              3 RUE ST ROMAIN
## 9                0              1 215000         82 RUE JEAN SARRAZIN
## 10               0              0  74370           131 AV JEAN JAURES
##                  district latitude longitude date_construction price_per_meter
## 1  Lyon 3e Arrondissement 45.75182  4.869549        0096-08-29        3695.652
## 2            Villeurbanne 45.76419  4.880895        0003-06-11        3516.129
## 3            Villeurbanne 45.77038  4.886646        0003-06-11        2447.761
## 4            Villeurbanne 45.77009  4.883356        0003-06-11        2343.750
## 5  Lyon 9e Arrondissement 45.80075  4.832650        0090-01-01        4502.941
## 6  Lyon 5e Arrondissement 45.74706  4.788048        0090-01-01        5377.975
## 7  Lyon 9e Arrondissement 45.77423  4.807017        0090-01-01        3113.208
## 8  Lyon 8e Arrondissement 45.74261  4.861444        0090-01-01        3288.182
## 9  Lyon 8e Arrondissement 45.73115  4.863194        0090-01-01        1869.565
## 10 Lyon 7e Arrondissement 45.74133  4.839913        0093-04-03        3718.500
##    closest_distance                closest_station
## 1          99.26294         Dauphiné - Lacassagne
## 2         552.59206                    Gratte-Ciel
## 3         358.64530         Flachet - Alain Gilles
## 4         141.03288                    Gratte-Ciel
## 5        1717.61237                          Cuire
## 6        2511.09511 Hôtel de région - Montrochet
## 7         132.29564                          Valmy
## 8         326.75460     Jet d'Eau - Mendès France
## 9         316.86765                  Petite Guille
## 10        467.16217             Place Jean Jaurès

first we remove outliers:

outliers <- boxplot(df$price_per_meter, plot=FALSE)$out

x<- df[-which(df$price_per_meter %in% outliers),]
plot(density(df$price_per_meter),main = "Density plot of price/meter (Original)",xlab = "Price/Meter")

plot(density(x$price_per_meter),main = "Density plot of price/meter (after removing outliers)",xlab = "Price/Meter")

#ggqqplot(df$price)

We also draw a histogram and fit a normal curve on it:

g = df$price_per_meter

h <- hist(g, breaks = 100,
          col = "mediumpurple3", xlab = "Price/meter", main = "Overall") 
xfit <- seq(min(g), max(g), length = 400) 
yfit <- dnorm(xfit, mean = mean(g), sd = sd(g)) 
yfit <- yfit * diff(h$mids[1:2]) * length(g) 

lines(xfit, yfit, col = "red", lwd = 2)

Now we do the normality test:

set.seed(2)

#Show 10 random rows:
sample <- dplyr::sample_n(x, 100)
shapiro.test(sample$price_per_meter)
## 
##  Shapiro-Wilk normality test
## 
## data:  sample$price_per_meter
## W = 0.97677, p-value = 0.07433

With a threshold of 0.05 for p-value, if \(p_{value} > 0.05\) then with significance level of \(95%\) we fail to reject the null hypothesis. Since p-value is much more than \(0.05\) we fail to reject the normality of the price distribution and with probability more than \(95%\) the price distribution is normal.

4: Is the area distribution normal?

Finding out that price is mostly affected by surface housing (in part 1, where we drew correlation matrix), and the fact that it has normal distribution made me wonder if area too had a normal distribution. So we’ll repeat everything for area

set.seed(13)

#Show 10 random rows:
dplyr::sample_n(df, 10)
##    date_transaction type_purchase type_property rooms_count surface_housing
## 1        0017-12-11        ancien   appartement           2              46
## 2        0020-04-09        ancien   appartement           2              62
## 3        0019-05-13        ancien   appartement           4              67
## 4        0017-03-10        ancien   appartement           4              64
## 5        0017-05-10          VEFA   appartement           2              34
## 6        0020-03-12          VEFA   appartement           4              79
## 7        0021-04-30        ancien   appartement           2              53
## 8        0018-08-31        ancien   appartement           5             110
## 9        0019-05-31        ancien   appartement           5             115
## 10       0018-12-28        ancien   appartement           1              20
##    surface_terrain parkings_count  price                      address
## 1                0              1 170000 9 RUE PROFESSEUR PAUL SISLEY
## 2                0              1 218000         63 RUE PAUL VERLAINE
## 3                0              0 164000       173 RUE ANATOLE FRANCE
## 4                0              0 150000         12 AV CDT LHERMINIER
## 5                0              1 153100  16 RUE DES DOCTEURS CORDIER
## 6                0              1 424860   140 RUE COMMANDANT CHARCOT
## 7                0              0 165000              3 RUE DU MARCHE
## 8                0              1 361700              3 RUE ST ROMAIN
## 9                0              1 215000         82 RUE JEAN SARRAZIN
## 10               0              0  74370           131 AV JEAN JAURES
##                  district latitude longitude date_construction price_per_meter
## 1  Lyon 3e Arrondissement 45.75182  4.869549        0096-08-29        3695.652
## 2            Villeurbanne 45.76419  4.880895        0003-06-11        3516.129
## 3            Villeurbanne 45.77038  4.886646        0003-06-11        2447.761
## 4            Villeurbanne 45.77009  4.883356        0003-06-11        2343.750
## 5  Lyon 9e Arrondissement 45.80075  4.832650        0090-01-01        4502.941
## 6  Lyon 5e Arrondissement 45.74706  4.788048        0090-01-01        5377.975
## 7  Lyon 9e Arrondissement 45.77423  4.807017        0090-01-01        3113.208
## 8  Lyon 8e Arrondissement 45.74261  4.861444        0090-01-01        3288.182
## 9  Lyon 8e Arrondissement 45.73115  4.863194        0090-01-01        1869.565
## 10 Lyon 7e Arrondissement 45.74133  4.839913        0093-04-03        3718.500
##    closest_distance                closest_station
## 1          99.26294         Dauphiné - Lacassagne
## 2         552.59206                    Gratte-Ciel
## 3         358.64530         Flachet - Alain Gilles
## 4         141.03288                    Gratte-Ciel
## 5        1717.61237                          Cuire
## 6        2511.09511 Hôtel de région - Montrochet
## 7         132.29564                          Valmy
## 8         326.75460     Jet d'Eau - Mendès France
## 9         316.86765                  Petite Guille
## 10        467.16217             Place Jean Jaurès

first we remove outliers:

outliers <- boxplot(df$surface_housing, plot=FALSE)$out

x<- df[-which(df$surface_housing%in% outliers),]
plot(density(df$surface_housing),main = "Density plot of area (Original)",xlab = "Price/Meter")

plot(density(x$surface_housing),main = "Density plot of area (after removing outliers)",xlab = "Price/Meter")

#ggqqplot(df$price)
set.seed(2)

#Show 10 random rows:
sample <- dplyr::sample_n(x, 100)
shapiro.test(sample$surface_housing)
## 
##  Shapiro-Wilk normality test
## 
## data:  sample$surface_housing
## W = 0.98349, p-value = 0.2461

Even though it may not look like it, but due to \(P_{value} > 0.05\) we fail to reject that distribution is normal

why 3 and 4 indicate useful results?

Knowing the distribution of something makes it predictable and easier to estimate. For some important parameter like price, knowing such thing has its own benefits. It can also help us find out prices that are significantly high ar low.

(For example if the price is really low you should suspect that the house is haunted, and you don’t wanna live in it and turn your life into a nightmare. )

5: Relation between price and surface housing

We expect for larger houses to be more expensive, however we don’t expect that price per meter to change with area of the house.

model = lm(formula = price ~ surface_housing, df)

summary(model)
## 
## Call:
## lm(formula = price ~ surface_housing, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -852973  -57388   -1768   50073 2463692 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11056.18    1285.94  -8.598   <2e-16 ***
## surface_housing   4092.05      18.09 226.146   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102600 on 40371 degrees of freedom
## Multiple R-squared:  0.5588, Adjusted R-squared:  0.5588 
## F-statistic: 5.114e+04 on 1 and 40371 DF,  p-value: < 2.2e-16
plot(df[['surface_housing']], df[['price']], main = paste('Housing Price'), xlab = 'Area', ylab = 'Price', col = 'mediumpurple')
abline(model,lwd= 4, col = 'lightslateblue') 

We can use slope and its standard deviation to calculate a confidence interval for the real slope (and not just the slope for sample)

\[I = [\mu - 1.96\sigma , \mu + 1.96\sigma] \] where 1.96 is the 95% confidence interval for standard normal distribution. \[ \mu = 4092.05\] \[\sigma = 18.09 \] \[ I = [4056, 4127]\] obviously \(0 \notin I\) Thus with 95% probability area and price have linear relation. (well who didn’t already know this? :) )

However doing exact the same things for price/meter and area:

model = lm(formula = price_per_meter ~ surface_housing, df)

summary(model)
## 
## Call:
## lm(formula = price_per_meter ~ surface_housing, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3111.7  -987.0  -119.4   870.1 30836.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4196.6511    17.1766  244.32   <2e-16 ***
## surface_housing   -3.5399     0.2417  -14.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1371 on 40371 degrees of freedom
## Multiple R-squared:  0.005285,   Adjusted R-squared:  0.005261 
## F-statistic: 214.5 on 1 and 40371 DF,  p-value: < 2.2e-16
plot(df[['surface_housing']], df[['price_per_meter']], main = paste('Housing Price'), xlab = 'Area', ylab = 'Price/meter', col = 'mediumpurple')
abline(model,lwd= 4, col = 'lightslateblue') 

Looking at \(R^2\) and plots we’ll instantly realize that area is responsible for almost \(50%\) of changes in price but has nothing to do with price/meter since the line is horizontal and R-squared is very close to 0.

6: Closer to station, more expensive?

One might think the closer a house is to a station, the more expensive it must be, because of the benefit of its location. Let’s check this out.

model = lm(formula = price_per_meter ~ closest_distance, df)

summary(model)
## 
## Call:
## lm(formula = price_per_meter ~ closest_distance, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3205.8  -949.7  -126.6   839.5 30844.9 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4226.76134   10.17923  415.23   <2e-16 ***
## closest_distance   -0.54513    0.01593  -34.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1355 on 40371 degrees of freedom
## Multiple R-squared:  0.0282, Adjusted R-squared:  0.02817 
## F-statistic:  1171 on 1 and 40371 DF,  p-value: < 2.2e-16
plot(df[['closest_distance']], df[['price_per_meter']], main = paste('Housing Price vs distance to station'), xlab = 'distance to closest station', ylab = 'Price', col = 'mediumpurple')
abline(model,lwd= 4, col = 'lightslateblue') 

Despite a slight negative slope, it doesn’t seem that this parameter has much effect on price. nevertheless r-squared is very close to 0 too.

7: Price and everything else!

In previous part we analyzed the relation between price and some other factors separately. Now we’ll try to model price as a function of more than one parameter. We saw that area is responsible for 50% of changes in price. We also saw that district has a slight affect on it.

#remove price per meter column and room count
dummydf3 = dummydf[,-c(1,2,3,4,5,17)]

model = lm(formula = price~., dummydf3)

summary(model)
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = price ~ ., data = dummydf3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.284e-08 -1.500e-12  4.000e-13  2.700e-12  2.286e-10 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  5.084e-11  1.708e-12  2.977e+01  < 2e-16 ***
## dist1       -8.478e-11  2.931e-12 -2.892e+01  < 2e-16 ***
## dist2       -1.239e-10  3.267e-12 -3.792e+01  < 2e-16 ***
## dist3       -6.268e-11  1.884e-12 -3.326e+01  < 2e-16 ***
## dist4       -9.278e-11  2.701e-12 -3.435e+01  < 2e-16 ***
## dist5       -8.858e-12  2.391e-12 -3.704e+00 0.000213 ***
## dist6       -1.455e-10  2.696e-12 -5.396e+01  < 2e-16 ***
## dist7       -5.695e-11  1.956e-12 -2.911e+01  < 2e-16 ***
## dist8       -2.828e-11  2.025e-12 -1.397e+01  < 2e-16 ***
## dist9       -8.159e-13  2.276e-12 -3.580e-01 0.720017    
## area        -4.669e-12  3.117e-14 -1.498e+02  < 2e-16 ***
## total_price  1.000e+00  5.985e-18  1.671e+17  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.139e-10 on 40361 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.757e+33 on 11 and 40361 DF,  p-value: < 2.2e-16

The R-squared is very large! This means that, district and area together determine a very large part of the total price. ## 8: Different prices for different years

In previous part we didn’t consider time to have an affect on price. In this part we try to check price/meter mean for each district but this time separately for each year.

#create new df and add required data to it:

td = df[,c(10,14)]
td['year'] <- year(df$date_transaction)


dists = unique(td[c('district')])
years = unique(td[c('year')])


tdy = data.frame(district = c(0), year = c(0) , mean_price_meter = c(0))

for (i in (1:nrow(dists))){
  
  word = as.character(dists[i,1])
  
  filtered <- td[as.character(td$district) == word,]
  
  for (j in (1:nrow(years))){
  
    y = as.numeric(years[j,1])

    filtered2 <- filtered[as.numeric(filtered$year) == y,]
  
    mdtp <-  mean(filtered2$price_per_meter)
    
    if(!is.na(mdtp)){
      tdy[nrow(tdy)+1,] = c(word , y , mdtp)
    }
  }
}
  
#remove first row:
tdy = tdy[-1,]

# from 1990 to 2023

newyear = tdy$year

 for (j in (1:nrow(years))){
    y = as.numeric(years[j,1])
    if (-1 < y & y < 24){
    newyear[newyear == y] <- 2000+y
    }
    else{
      newyear[newyear == y] <- 1900+y
    }
 }

tdy['year'] = newyear

#tdy = tdy[order(tdy$year),]

tdy$mean_price_meter =as.numeric(as.character(tdy$mean_price_meter))

tdy$mean_price_meter  = round(tdy$mean_price_meter )

tdy$year =as.numeric(as.character(tdy$year))

ggplot(tdy,aes(x = year, y = mean_price_meter, group = district))+geom_line(aes(color = district) )+geom_point(aes(color = district))

tdyear <- summarise_at(group_by(td, year), vars(price_per_meter), mean)


ggplot(tdyear , aes(x=year, y=price_per_meter)) + geom_line(color = "purple") +geom_point( color = "mediumorchid", size = 3) + ggtitle("Average housing prices/meter vs. Year")

model = lm(price_per_meter ~ year, tdyear)

summary(model)
## 
## Call:
## lm(formula = price_per_meter ~ year, data = tdyear)
## 
## Residuals:
##       1       2       3       4       5       6 
##   6.564  60.892 -49.080 -93.128  57.105  17.646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -602.76     301.25  -2.001 0.116005    
## year          250.59      16.22  15.454 0.000102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.83 on 4 degrees of freedom
## Multiple R-squared:  0.9835, Adjusted R-squared:  0.9794 
## F-statistic: 238.8 on 1 and 4 DF,  p-value: 0.0001023

The R-squared is 0.98! Almost 1. the R-squared for district was 0.16 and 0.005 for area (recall that it’s price per meter we’re talking about). It seems like the most important factor that affects price/meter is time rather than location.

The fact that prices grow larger with time show inflation, and analysis of price over time can be very useful for economical means

9: Are maisons larger than appertments?

h <- df[, c('rooms_count', 'type_property')]

freq <- dcast(as.data.table(h), rooms_count ~ type_property)
## Using 'type_property' as value column. Use 'value.var' to override
## Aggregate function missing, defaulting to 'length'
freq <- freq[, -1]

dt <- as.table(as.matrix(freq))

balloonplot(t(dt), main = "housetasks", xlab = "type_property", ylab = "rooms_count", label = TRUE, show.margins = FALSE)

boxplot(surface_housing ~ type_property , xlab = 'type of property',data=df, col = 'purple', main="Boxplot for type propertyvs surface")

The plot shows us higher average for maisons’ surface. However the plots are not very obvious for rooms count. most apartments have 3 rooms and most houses have 4 rooms. he difference is so little that, we can’t just make a conclusion out of these plots. Now we test our hypothesis with t-test.

house = df[df$"type_property" == "maison", c("surface_housing")]
appart = df[df$"type_property" == "appartement", c("surface_housing")]
room_app = df[df$"type_property" == "appartement", c("rooms_count")]
room_house = df[df$"type_property" == "maison", c("rooms_count")]


res = t.test(house,appart)

res2 = t.test(room_house,room_app)

print("t-test result for surface housing")
## [1] "t-test result for surface housing"
res
## 
##  Welch Two Sample t-test
## 
## data:  house and appart
## t = 25.493, df = 682.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  36.61301 42.72343
## sample estimates:
## mean of x mean of y 
##  104.2288   64.5606
print("t-test result for rooms count")
## [1] "t-test result for rooms count"
res2
## 
##  Welch Two Sample t-test
## 
## data:  room_house and room_app
## t = 31.352, df = 695.15, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.327432 1.504795
## sample estimates:
## mean of x mean of y 
##  4.184250  2.768136

Thus with probability \(95%\), The averages are not equal. In fact because \(p_{value} < 0.01\) then with 99% confidence level, two groups have different means and thus maisons are generally larger than apartments.

Part 3 : Where to buy a house?

I would want a house with below properties.

1. Close to university

2. Close to station

3. Room number of 3 or 4

4. Area between 50 to 100

5. The lower the price the better!

6. Type doesn’t matter

7. I don’t have a car so parking doesn’t matter

Thus the first thing is to fix the location of the house. since all prices and area can be found in all districts, I choose district based on closeness to university at first. Let’s draw a map of districts again but this time also show university location on it.

#coordinates of university
unilong = 4.8655
unilat = 45.7802

ggplot() +geom_point( data=df, aes(y=latitude, x=longitude, color = as.factor(district)), size = 0.4)+geom_circle(aes(x0 = unilong, y0 = unilat, r = 0.005))

The university is in Villeurbanne. From the location of it, (which isn’t on the border of district and close to any other one), It’ll be idiotic to buy house anywhere else. However district 6 is also close But if you recall Part 2.1 where we drew histograms of prices, Villeurbanne was one the cheapest districts and 6 was the most expensive! Here we have it again:

dists = unique(df[c('district')])

mean_dist_price = c()

for (i in (1:nrow(dists))){
  word = as.character(dists[i,1])
  filtered <- df[as.character(df$district) == dists[i,1],]
  
  mean_dist_price <- c(mean_dist_price, mean(filtered$price_per_meter))
}
  

par(mar=c(11,4,4,4))

barplot(mean_dist_price,
        main = "distrinct mean prices",
        ylab = "mean price",
        names.arg = dists$district, cex.names  = 1,las =2,
        col = "hotpink")

Thus I choose Villeurbanne over district 6 because of the prices.

To come up with further detail, we filter only houses in this district:

vil <- filter(df, district == "Villeurbanne")

#some coordinates are missing. I don't want those that don't have coordinates
vil <- na.omit(vil)

#reset index
rownames(vil) <- NULL

Then we design a formula to give each house a certain score based on properties we desire.

score_func <- function(i) {
  
    room_num = vil$rooms_count[i]
    
    if((room_num < 3) || (room_num > 5)){
      # I don't want a house that has less than 3 or more than 5 rooms
      return(0)
    }
    
    area = as.numeric(vil$surface_housing[i])
    if(area < 50){
      
      # I don't want a house that is smaller than 50 m^2
      return(0)
    }
    
    
    price = as.numeric(vil$price[i])
    
    mean_price = mean(vil$price)
    
    scaled_price = (price - mean_price)/mean_price
    
    d_station = as.numeric(vil$closest_distance[i])
    
    d_uni = measure_distance(as.numeric(vil$latitude[i]),as.numeric(unilat),as.numeric(vil$longitude[i]),as.numeric(unilong) )
    
    #the larger the price, the worst! and it's the most important factor thus its coefficient is 10
    #the lesser the distances, the better! however distance to university is more important
    score = (-10*scaled_price)+(-5*d_uni)+(-3*d_station)
    
    return(score)
       
    
}

s = c()
for (i in (1 : nrow(vil))){
  s = c(s , score_func(i))
}

vil['score'] <- s


vil = vil[vil$score != 0, ]

#sort
vil <- vil[order(-vil$score),]


head(vil,20)
date_transaction type_purchase type_property rooms_count surface_housing surface_terrain parkings_count price address district latitude longitude date_construction price_per_meter closest_distance closest_station score
3743 0016-07-18 ancien appartement 3 79 0 0 150000 1 RUE DU TONKIN Villeurbanne 45.77831 4.865586 0003-06-11 1898.734 105.2488 Condorcet -1367.261
3751 0020-07-09 ancien appartement 4 94 0 3 218000 1 RUE DU TONKIN Villeurbanne 45.77831 4.865586 0003-06-11 2319.149 105.2488 Condorcet -1370.602
3745 0017-01-30 ancien appartement 4 94 0 0 223550 1 RUE DU TONKIN Villeurbanne 45.77831 4.865586 0003-06-11 2378.191 105.2488 Condorcet -1370.874
3750 0020-02-07 ancien appartement 4 94 0 3 240000 1 RUE DU TONKIN Villeurbanne 45.77831 4.865586 0003-06-11 2553.191 105.2488 Condorcet -1371.683
3753 0020-12-10 ancien appartement 3 76 0 1 245000 1 RUE DU TONKIN Villeurbanne 45.77831 4.865586 0003-06-11 3223.684 105.2488 Condorcet -1371.928
3748 0019-09-30 ancien appartement 4 94 0 0 299730 1 RUE DU TONKIN Villeurbanne 45.77831 4.865586 0003-06-11 3188.617 105.2488 Condorcet -1374.617
3854 0016-12-01 ancien appartement 4 75 0 0 104643 4 RUE DU TONKIN Villeurbanne 45.77861 4.864574 0018-05-23 1395.240 153.1833 Condorcet -1409.876
3858 0020-07-30 ancien appartement 4 75 0 0 124853 4 RUE DU TONKIN Villeurbanne 45.77861 4.864574 0018-05-23 1664.707 153.1833 Condorcet -1410.869
3852 0016-11-10 ancien appartement 4 75 0 1 133600 50 BD DU ONZE NOVEMBRE 1918 Villeurbanne 45.77861 4.864574 0018-05-23 1781.333 153.1833 Condorcet -1411.299
3857 0019-11-21 ancien appartement 3 67 0 2 135264 50 BD DU ONZE NOVEMBRE 1918 Villeurbanne 45.77861 4.864574 0018-05-23 2018.866 153.1833 Condorcet -1411.380
3853 0016-11-29 ancien appartement 5 97 0 0 135675 4 RUE DU TONKIN Villeurbanne 45.77861 4.864574 0018-05-23 1398.711 153.1833 Condorcet -1411.401
3855 0017-03-06 ancien appartement 4 87 0 1 141400 4 RUE DU TONKIN Villeurbanne 45.77861 4.864574 0018-05-23 1625.287 153.1833 Condorcet -1411.682
3718 0017-04-19 ancien appartement 3 65 0 1 137500 40 BD DU ONZE NOVEMBRE 1918 Villeurbanne 45.77871 4.863584 0003-06-11 2115.385 225.5238 Condorcet -1788.533
3719 0017-08-01 ancien appartement 3 65 0 1 156500 40 BD DU ONZE NOVEMBRE 1918 Villeurbanne 45.77871 4.863584 0003-06-11 2407.692 225.5238 Condorcet -1789.467
3717 0017-03-01 ancien appartement 3 65 0 1 168000 36 BD DU ONZE NOVEMBRE 1918 Villeurbanne 45.77871 4.863584 0003-06-11 2584.615 225.5238 Condorcet -1790.032
3727 0019-01-14 ancien appartement 4 74 0 1 195500 7 AV ROBERTO ROSSELLINI Villeurbanne 45.77871 4.863584 0003-06-11 2641.892 225.5238 Condorcet -1791.383
3730 0019-05-23 ancien appartement 3 65 0 1 215000 40 BD DU ONZE NOVEMBRE 1918 Villeurbanne 45.77871 4.863584 0003-06-11 3307.692 225.5238 Condorcet -1792.341
3645 0017-10-10 ancien appartement 3 68 0 1 145000 14 RUE DU TONKIN Villeurbanne 45.77795 4.864337 0003-06-11 2132.353 203.4334 Condorcet -1936.011
3643 0017-04-24 ancien appartement 4 81 0 1 160000 18 RUE DU TONKIN Villeurbanne 45.77795 4.864337 0003-06-11 1975.309 203.4334 Condorcet -1936.748
3652 0018-03-19 ancien appartement 4 81 0 1 172100 18 RUE DU TONKIN Villeurbanne 45.77795 4.864337 0003-06-11 2124.691 203.4334 Condorcet -1937.343

These are houses with highest score in descending order.

And below is the score map

ggmap(map)+
geom_point(
    data = vil,
    mapping = aes(x = longitude, y = latitude, color = score, shape = type_property)
    , size = 1.5
  )+ scale_color_gradientn(colours=c("cyan","darkorchid1"))+scale_shape_manual(values = c(4,15))

ggplot() +geom_point( data=vil, aes(y=latitude, x=longitude, color = score), size = 1.5)+geom_circle(aes(x0 = unilong, y0 = unilat, r = 0.007))+ scale_color_gradientn(colours=c("cyan","darkorchid1"))+scale_shape_manual(values = c(4,15))

Here is the area where I would want to buy a house. And the first rows of vil data frame are the most ideal houses.